Step 1: Ask

Business Task

This analysis was initiated by Urška Sršen, co-founder of Bellabeat. Sršen knows that an analysis of available data on consumers’ smart device usage would reveal opportunities for the company to grow, and has provided the following business task:

Analyse smart device usage data to gain insight into how people are already using smart devices, then generate high-level recommendations for how these insights can inform the marketing strategy for one Bellabeat product.

For this analysis, I chose to produce recommendations for the Bellabeat Ivy. At time of writing this was Bellabeat’s most fully-featured product, and the one that most closely aligned with the kind of functionality expected from existing wearable smart devices, so I considered it the product that my available data would be the most directly applicable to.

Key Stakeholders

  • Urška Sršen
    • Co-founder of Bellabeat
    • Initiator of this analysis
  • Bellabeat marketing team
    • Intended audience for my presentation
    • Will use my insights to guide marketing strategies

Step 2: Prepare

Selecting my tools

For this analysis I wanted a tool or set of tools with the following features:

  • Sufficient power enough to handle large data sets, e.g. FitBit data tables with >1M observations
  • Functions for data manipulation, e.g. loading, cleaning and combining data sets
  • Functions for data analysis, e.g. regression analysis and statistical analysis
  • Functions for data visualisation, e.g. plotting
  • Methods for storing the details of my analysis methodology separate from the data itself, e.g. separate source code files, as opposed to macros stored within spreadsheet files
  • Methods for generating reports from my analysis with as little repetition of work as possible, e.g. inline markdown languages, or Page Layout views in spreadsheet applications
  • A straightforward learning process and user interface, e.g. spreadsheet tools typically have a single “Add Chart” tool under which all of their powerful charting options can be found; contrast this with learning to download, enable, and finally use the ggplot2 R library

With these requirements in mind, I considered three tools: R, spreadsheets, and databases:

Feature R Spreadsheets Databases
Power for large data sets Yes No Yes
Data manipulation tools Yes Yes Yes
Data analysis tools Yes Yes No
Data visualisation tools Yes Yes No
Separate analysis files Yes No Yes
Streamlined report generation Yes No No
Straightforward to learn No Yes No

Clearly, R is the best single tool for this analysis. R lacks the straightfoward operation of spreadsheet tools, and I will need to learn libraries and programming techniques as I do my analysis, but this is acceptable given my prior experience with other programming languages like Python.

Note: Python itself was not considered for this analysis: while it shares most of the features, advantages, and disadvantages of R, I’m already familiar with Python and wanted to use this case study to familiarise myself with R instead.

Setting up my RStudio environment

I set up my RStudio environment using the following packages:

  • dplyr
  • anytime
  • tidyverse
  • janitor
  • readr
  • skimr
  • tibble
  • ggplot2
  • gridExtra
  • ggpubr
  • kableExtra
  • purrr

Getting the data

For this data analysis, I’ll be making use of one public data set specified by Sršen. The data set specified is the FitBit Fitness Tracker Data Set. This is a public data set available under the CC0 License via Kaggle user Mobius.

For the initial analysis, the data set was downloaded in its entirety from Kaggle and stored locally on my computer. This provided a baseline for analysis, with any modifications to file names, folder re-structuring, or removal of unnecessary data tables to be conducted after I’m familiar with the raw data. The CSV files in the data set were loaded directly into the environment as data frames.

Understanding the data

Overview

The database contains data on the following features of the FitBit devices:

Feature Units Sampling Rate
BMI BMI Manual/automatic logging
Body Weight kg Manual/automatic logging
Body Fat % Manual/automatic logging
Calorie burn Calories 1 minute
Distance Unknown 1 minute
Heart Rate Beats per Minute (BPM) 5 seconds
Intensity (of exercise) Factor, 0 to 3 1 minute
METs (during exercise) METs 1 minute
Sleep Factor, 1 to 3 1 minute
Steps Steps 1 minute

Issues

Inconsistent and misleading file names

The file names in the data set present the following issues:

  • It’s unclear if data is source or summarized, e.g. “minuteCaloriesNarrow_merged.csv” is source data, while “dailyCalories_merged.csv” is summarized data
  • Summarized data is not always tied to specific features, e.g. “dailyActivities_merged.csv” contains data summarized from multiple features: there is no “Activity” feature with an associated source data set.
  • Data shape is unclear:
    • “minuteCaloriesNarrow_merged.csv” and “dailyCalories_merged.csv” are both tall (or “Narrow”) data
    • “minuteIntensitiesWide_merged.csv” and “dailyIntensities_merged.csv” are both wide data
  • Data sampling rate is inconsistently formatted
    • Compare “heartrate_seconds_merged.csv” to “daySleep_merged.csv”
  • Sample rate usually comes before feature name in the file name, affecting file sort operations
  • All files are suffixed with “_merged”, so no distinction is made by this information and it could be dropped

To correct each of these issues, I’ll rename the files using this naming convention:

[feature]_[src/sum]_[interval]_[shape].[filetype]

For example, “sleepDay_merged.csv” will be renamed to “sleep_sum_days_wide.csv”. This format removes the ambiguities in the original file names, and ensures all required information on the contents of the data is included.

Inconsistent variable names

All variables in the data set will be changed from CapitalisedCase to snake_case, by convention and to address an issue with inconsistently-capitalised variables (e.g. logId/logid).

Inappropriate variable types

Some of the tables in the data set use variables of a type unsuitable for analysis. These variables and their required modifications are tabulated below:

Variable Original Type Required Type Issue
ActivityDay chr datetime Cannot perform datetime operations on chr variables
ActivityHour chr datetime Cannot perform datetime operations on chr variables
ActivityMinute chr datetime Cannot perform datetime operations on chr variables
Date chr datetime Cannot perform datetime operations on chr variables
Id num chr Disable scientific notation and numerical operations (IDs are not a numeric value)
LogId num chr Disable scientific notation and numerical operations (IDs are not a numeric value)
SleepDay chr datetime Cannot perform datetime operations on chr variables
Time chr datetime Cannot perform datetime operations on chr variables

Duplicate data within tables

The “bodycomp_logs_src_wide.csv” file contains both kilogram and pound variables: these describe the same information, and all of the observations contain values for both variables, so one variable can be dropped with no loss of data. The choice between the two formats seems arbitrary for my analysis, so I’m choosing to keep the kilos data as its expressed in an SI unit.

intensity_sum_hours_wide.csv contains Total Intensity and Average Intensity. Total intensity values exceed 4, the maximum for intensity, and so are not actually useful. Average Intensity is just the Total Intensity for each hour divided by 60 minutes per hour.

Missing context for numeric variables

All of the numeric variables in the data set have clearly defined units except for “Distance”. Distance does not appear to have a source data table: it’s only included in the “activity_sum_wide_days” and “intensity_sum_wide_days” tables, and only as summary data grouped via Intensity level. Daily values between 0 and 1 are present, so it seems reasonable to assume this is either kilometers or miles, as opposed to meters or feet. Miles and kilometers represent the same information on slightly different scales, so the analysis should not be affected as long as the units are consistent across tables, which they appear to be. I will therefore assume the distances are given in kilometers for this analysis.

Missing context for factor variables

Two variables in the data set, “Intensity” (for exercise) and “Value” (for sleep), are numerical factors with no defined range. For these factors, I couldn’t tell from the data alone whether all possible values that a FitBit can record are present, or what real-world states the numeric values represented.

Analysis of the factor variables confirmed they followed the mappings listed below. Full details of the analysis can be found in Appendix A.

Exercise Intensity:

  1. Sedentary
  2. Lightly Active
  3. Fairly Active
  4. Very Active

Sleep Value

  • 1: Asleep
  • 2: Restless
  • 3: Awake

Step 3: Clean

Data Cleaning Checklist

The following broad cleaning steps were taken:

  • Updated CSV file names to new standard
  • Removed missing data
  • Cleaned and updated variable names
  • Removed duplicate data
  • Recast mismatched variable data types
  • Trimmed leading or trailing characters
  • Validated numeric data

These steps are further detailed in the sections below.

Update CSV file names to new standard

For this analysis, I renamed the files according to the following convention:

[feature]_[src/sum]_[interval]_[shape].[filetype]

Applying the naming convention to the data set yielded the following file names:

Original Updated
dailyActivity_merged.csv activity_sum_days_wide.csv
dailyCalories_merged.csv calories_sum_days_tall.csv
dailyIntensities_merged.csv intensity_sum_days_wide.csv
dailySteps_merged.csv steps_sum_days_tall.csv
heartrate_seconds_merged.csv heartrate_src_seconds_tall.csv
hourlyCalories_merged.csv calories_sum_hours_tall.csv
hourlyIntensities_merged.csv intensity_sum_hours_wide.csv
hourlySteps_merged.csv steps_sum_hours_tall.csv
minuteCaloriesNarrow_merged.csv calories_src_mins_tall.csv
minuteCaloriesWide_merged.csv calories_sum_mins_wide.csv
minuteIntensitiesNarrow_merged.csv intensity_src_mins_tall.csv
minuteIntensitiesWide_merged.csv intensity_sum_mins_wide.csv
minuteMETsNarrow_merged.csv mets_src_mins_tall.csv
minuteSleep_merged.csv sleep_src_mins_tall.csv
minuteStepsNarrow_merged.csv steps_src_mins_tall.csv
minuteStepsWide_merged.csv steps_sum_mins_wide.csv
sleepDay_merged.csv sleep_sum_days_wide.csv
weightLogInfo_merged.csv bodycomp_src_logs_wide.csv

The conversion was performed manually on my local device.

Remove missing data

Each data frame was checked for NULL values and non-null empty strings.

The only table with missing data was “bodycomp_src_logs_wide”, which was missing all but two of the data points for Body Fat Percentage. Given this lack of data, Body Fat Percentage was excluded from the data analysis.

Clean and update variable names

Variables were cleaned and renamed were required according to the following specification:

Original Name Updated Name
activity_date activity_day
date (bodycomp_src_logs_wide.csv) bodycomp_datetime
date (sleep_src_mins_tall.csv) sleep_minute
log_id (sleep_src_mins_tall.csv) sleep_log_id
log_id (bodycomp_src_logs_wide.csv) bodycomp_log_id
fairly_active_distance distance_fairly_active
fairly_active_minutes minutes_fairly_active
light_active_distance distance_lightly_active
light_active_minutes minutes_lightly_active
lightly_active_distance distance_lightly_active
lightly_active_minutes minutes_lightly_active
logged_activities_distance distance_logged_activities
me_ts mets
moderately_active_distance distance_moderately_active
sedentary_active_distance distance_sedentary
sedentary_distance distance_sedentary
sedentary_active_minutes minutes_sedentary
sedentary_minutes minutes_sedentary
step_total steps_total
time (heartrate_src_seconds_tall.csv) heart_rate_second
total_distance distance_total
total_intensity intensity_total
total_minutes_asleep minutes_asleep_total
total_sleep_records sleep_records_total
total_steps steps_total
total_time_in_bed minutes_in_bed_total
tracker_distance distance_tracker
value (heartrate_src_seconds_tall.csv) heart_rate
value (sleep_src_mins_tall.csv) sleep_rank
very_active_distance distance_very_active
very_active_minutes minutes_very_active

Remove duplicate data

Duplicate rows were removed from all data frames using the anyDuplicated and distinct functions

Recast mismatched variable data types

Variables were renamed according to the following specification:

Variable Original Type Updated Type Reason
activity_day chr datetime Cannot perform datetime operations on chr variables
activity_hour chr datetime Cannot perform datetime operations on chr variables
activity_minute chr datetime Cannot perform datetime operations on chr variables
date chr datetime Cannot perform datetime operations on chr variables
id num chr Disable numerical operations (IDs are considered a UID string)
log_id num chr Disable numerical operations (IDs are considered a UID string)
time chr datetime Cannot perform datetime operations on chr variables
sleep_rank num factor 1:3 Disable numerical operations (value is a ranking, not an amount)
intensity num factor 0:3 Disable numerical operations (value is a ranking, not an amount)

Given the large number of variables in the data set, the recasting procedure includes test code to confirm the updated variable types match the types specified in the variable mods list. The test code works by first generating a list of the desired final variable/type pairs, then, once the conversion is completed, generating a second list of the actual variable/type pairs in the data frames to compare it to. This has the advantage of not just confirming the desired conversions took place, but also checks for any unintended changes to variables that did not require conversion.

Building the test code added a decent amount of work to the project, but now I have it working, it can be reused and scaled to future projects.

Trim leading or trailing characters

Non-whitespace leading and trailing characters were removed from all variable names and all character-type data fields.

Validate numeric data

Given that the data was not entered manually by the user, there was no way for me to manually check the correctness of all of the numeric values included in the data set. The values were instead checked against pre-determined limits to verify that they fell within realistic ranges, as detailed below.

Validating the data in this way also helps confirm the data makes sense in terms of the business logic, by confirming the data falls within realistic ranges given the capabilities of the devices and the types of data they claim to track.

ID lengths and value ranges are correct

All ID values in the data set were checked against their correct length according to the following specification

ID Variable Correct Length (Digits)
id 10
sleep_log_id 11
bodycomp_log_id 13

The validation confirmed all values were the correct length.

To further validate the large number of ID data points in the data set, I checked the minimum and maximum ID values in each data frame. The intent was to identify any possible erroneous values within the acceptable length limits, e.g. all zeroes or all nines.

The validation confirmed all ID values were within a realistic range.

Dates are all within range

The data set was described as containing data from users collected between “03.12.2016-05.12.2016”: all records in the data set were validated against this date range. An intial check found dates just outside the range, dating up to 8am on 05.13.2016, the day after the data set supposedly ended. I didn’t consider this to be a problem, so the valid end-date was updated to the 14th of May 2016 accordingly.

The validation confirmed all date values were within the target range.

Non-date values are within appropriate ranges for their units

Validation Checklist

  • All numeric values are non-negative
  • Percentage values are less than 100
  • Weight values are positive and make sense (e.g. less than 200kg)
  • BMI values are in range
  • Daily, hourly, and minute duration sums are no more than one day, hour, or minute, respectively
  • Distances make sense
  • Step counts make sense (check for >20000 for a start)
  • Calories are within normal range
  • Heart rates are less than 200

The validation code was split into generic checks that could be run on all numeric values, followed by code to analyse specific variables.

Validation Results

  • No negative values were found in the data.
  • There were no invalid negative values
  • There were no invalid percentage values
  • There were no invalid minute summations
  • There were 19 step count data points above 20,000: these were manually checked and found to be associated with long distances covered and high levels of exercise, which made sense, so I considered that data valid.
  • There were a total of four unique data points with a distance covered of more than one half-marathon: all of these also indicatd a very high level of exercise for at lest 90 minutes, which made sense, so I considered the data valid
  • Calories: 21 records showed caloric burns of over 4000 calories. All but one of those were associated with very high levels of activity: the outlier showed only 30 minutes total as “Very Active” or “Lightly Active”, compared to 120+ minutes for all other records. It does show 15km distance covered that day, which is nearly a third of a marathon, so for now it makes enough sense to keep it in, and I will analyse it further after the cleaning stage.
  • Heart-rate: 13 records were found with a heart-rate between 200 and 203, all of which corresponded to a single user over a 30-minute period of “Very Active” intensity.

Sources used to determine realistic limits for numerical values

Sources for ranges:

Step 4: Analyse

Method

There were a range of features to consider when comparing the FitBit devices in the dataset to the Bellabeat Ivy. Of these, I chose to focus this analysis on those features that were both available on the Ivy and available in the data set, either directly or by a suitable proxy data. This approach allowed me to focus my recommendations to the marketing team on those features that can actually be marketed to the public in the first place.

Heart Rate Tracking and Activity Tracking

The existing marketing material for the Ivy states that Heart-rate Tracking can be “used to track your workout progress and optimize personal training routines,” and that the Exercise Tracking feature “will recognize your activity during the day, help you track up to 80 types of activity, count your steps, and discover how all that affects your body.” These features are closely related, since one of the primary functions of the HR tracking is to detect exercise intensity; steps and distance tracking can also be used to further analyse users’ activity.

In order to better understand how people are using similar functions on their FitBits, I analysed the quantities and qualities of the exercise that users in the data set were logging.

How much time do people spend exercising?

I started by getting a high-level breakdown of the time users were spending at different intensity levels:

Preliminary Findings

  • The large majority of people’s time was spent in a sedentary state.
  • This made sense, since it included sleep tracking as well as time spent awake but immobile
  • There was a wide spread of times spent non-sedentary, with no clear outliers
  • The large majority of non-sedentary time was Lightly Active

Given how large the majority of non-Sedentary time is “Lightly Active”, and based on my understanding of the Intensity zones from previous sections, I assumed “Lightly Active” time included any activity more intense than sitting down, up to an including activities like walking for leisure. Anything more intense therefore fell into the “Fairly Active” or “Very Active” categories. With this in mind, I proceeded with the following definition:

“Active Time” is time logged in either the “Fairly Active” or “Very Active” intensity category, and is assumed to be time spent intentionally exercising.

Using this definition, I analysed users’ time spent intentionally exercising, i.e. their Active Time:

Preliminary Findings

  • 72.7% of the cohort (24 users) got less than 45 minutes of Active Time per week on average.
  • 18% of the cohort (6 users) did more than 1 hour a week on average.

How intensely do people exercise?

Here I analysed the data to see if there was any variation in the level of intensity of exercise between users. To do this I calculated each user’s Very Active Time as a proportion of their total Active Time: this gave me a single measure of how intensely each individual user was typically exercising:

Preliminary Findings

  • Exercise intensity was evenly spread out from around 10% Very Active to around 90% Very Active across all users, with no clear outliers.
  • The user group appeared to engage in a range of exercise activities of varying intensity.

Do people who exercise more do more-intense exercise?

The previous analysis showed that users were engaging in a range of intensity levels. I also wanted to identify if there was a trend in the cohort where users who exercised more often also exercised more intensely. This would help me characterise the user group: perhaps users who exercised harder tended to exercise longer; perhaps those who exercised longer were favouring lower-intensity activities, like walking; or perhaps there would be no pattern, indicating a variety of exercise preferences. To analyse this, I plotted the Very Active Proportion data from the previous analysis against total Active Time for each user:

Very Active Proportion was positively correlated with Active Time, with an R^2 value of 0.27. On visual inspection of the plot, the cohort appeared split into a majority group clustered around the lower end of Active Hours, and a smaller group with higher hours. Intuitively, this seemed to be skewing the scatter plot into the upper-right corner, and possibly contributing to a stronger correlation between hours and intensity. To put some numbers behind this, I re-drew the plot with the upper quintile of users removed:

With the upper quintile of the cohort removed, the correlation between hours and intensity dropped from 0.27 to 0.097. This supported my hypothesis that a small group of users was particularly active, both in terms of quanitity and intensity of exercise, while the remaining majority was spread out over a wider range.

Preliminary Findings

  • Very Active Proportion was positively correlated with Active time, with an R^2 value of 0.27.
  • This correlation was driven primarily by a minority of users with high Active Time and high Very Activity Proportion.
  • Overall, the results indicated that the majority of the cohort was engaged in a wide range of activities. There did not appear to be any evidence that a majority of users favoured a particular level of intensity.

Do people increase their activity over time?

One potential selling point for any health-oriented wearable device like FitBits is the idea that they actively encourage users to be more active. I wanted to see if this idea was borne out in the data: I knew that most users did not exercise much, on average, but did those values increase at all? If they did, there would be grounds to claim that device usage is at least associated with an increase in users’ physical activity.

To analyse this, I calculated the correlation coefficients between time and exercise hours, at each intensity level and for each user, which condesed the full months’ worth of data for each user into a single variable indicating their overall trend. These coefficents were then plotted as a histogram, allowing me to visualise the trends for all users at once:

The distributions of correlation coefficients were skewed slightly negative, peaking around -0.2 to -0.1. There was no indication that most users were increasing their exercise duration over time: in fact, the opposite seemed to be true.

As a final check, I re-ran the analysis, this time analysing the correlation between time and Very Active Proportion:

As with the overall categories, the correlation coefficients were mostly negative, with a Mean and Median of -0.04 and -0.08, respectively

Preliminary Findings

  • There was no evidence to suggest device users were becoming more active over time
  • Users were actually less likely to log active hours over time, and Very Active Time as a proportion of all Active Time also typically decreased slightly

Do people prefer to walk or run?

As with overall exercise intensity, I analysed user preferences towards walking or running to help target my recommendations towards the activities users are already engaged in. For this, I made use of the step count source data. This data is logged every minute, so each data point equates to the user’s average speed, expressed in steps-per-minute.

For this analysis I defined several different walking speeds based on steps-per-minute, which were intended to be analagous to the four activity Intensity levels:

Speed Range Steps per Minute
Not Walking 0
Moderate Walking 0 to 80
Brisk Walking 80 to 110
Running > 110

I also defined a value, “Active Walking”, to be used as an analog to the Active Time value used in the activity intensity analysis:

“Active Walking Time” is time logged in either the “Brisk Walking” or “Running” speed categories, and is assumed to be time spent intentionally walking, jogging, or running for exercise.

I then calculated the daily average times that users spent in each range:

As an alternative visualisation, I also plotted a histogram comparing just the non-immobile categories:

This plot breaks out the visualisation of the less-common walking speeds for easier comparison: it can be seen clearly that the majority of users are getting half an hour or less of Active Walking per day, but more than two hours of Moderate Walking per day.

Next, I analysed the quantity of Active Walking specifically that users were logging:

Finally, I analysed what proportion of Active Walking time that users were spending Running:

Preliminary Findings

  • The majority of users’ time is spent Not Walking
  • The majority of the rest of users’ time is spent Moderate Walking
  • The remaining time is split between Brisk Walking and Running, with a range of values across the cohort
  • Runners are rare, with 30 of 33 users logging 0.25 hours or less per week, and the remaining three outliers logging between 0.5 and 1.25 hours per week
  • The proportion of active walking spent running was clustered below 50%, indicating a preference for walking or jogging over running.

Do people increase their active walking hours over time?

To analyse changes in user movement habits over time, I repeated the analysis of time correlation coefficients I performed on the activity intensity data, but for the walking speed categories I had defined:

Preliminary Findings

  • Most users decreased or maintained their walking times
  • This finding was consistent across all walking types
  • There is no evidence to suggest FitBit users increased their walking times as a result of their FitBit device usage

Are people getting their steps in?

Step counts are commonly used as a measure of overall activity levels. The typical figure provided for a healthy adult is 10,000 steps per day: and example can be found in this Australian Health Survey which states that “For adults, 10,000 steps is used by researchers worldwide as a reasonable estimate of daily activity by healthy adults.” For this analysis, I first analysed overall daily step counts in the cohort:

Preliminary Findings

  • User step counts were distributed approximately normally around a mean of 7,500 steps per day
  • This was closer to the typically-recommended value of 10,000 steps than I was expecting
  • Combined with the earlier data indicating a very high proportion of users’ time being spent Not Walking, this analysis supported the hypothesis that the cohort that does not prioritise physical activity

Do people increase their step counts over time?

Preliminary Findings

  • As with walking types, the majority of users actually did fewer steps over time
  • The analysis further supported the hypothesis that the cohort that does not prioritise physical activity

Do people engage in stationary or mobile exercise?

Examples of stationary exercise include gym-based activities like treadmill running and weight-lifting. Examples of mobile exercise include outdoor activities like jogging, but also indoor activities like squash or, depending on the accuracy of the FitBit distance tracking, gym workouts based on rotation through various equipment and exercises spaced around the venue. Determining if users have a preference for one activity type over another can help guide the marketing for the Ivy.

I analysed stationary versus mobile activity preferences by plotting the data for Active Minutes against Active Distance. The theory was that the more stationary exercise users do, the weaker the correlation between active time and active distance should be, because users are not moving while exercising (i.e. machine-based cardio and resistance training).

Preliminary Findings

  • Active Minutes correlate strongly with Active distance, with R^2 = 0.77.
  • Clearly absent from the plot is a significant number of users with high Active Minutes but low Active Distance
  • Users typically log an additional 3.6 km of distance for every hour of Active time: to achieve this in a gym setting a user would have to follow a typical pattern resembling, for example, two one-minute stationary resistance training sets, followed by one minute moving around the gym at approximately 10.8 kilometers per hour, which seems unrealistic
  • Overall, it would appear that the cohort engages in a variety of mobile exercise activities.

Readiness Score

The “Readiness Score” feature of the Bellabeat Ivy makes use of several other features, including the Resting Heart Rate, Respiratory Rate, and Cardiac Coherence functions. The existing marketing material states that “While you sleep at night and your body is in its calmest state, Ivy measures your resting heart rate, respiratory rate, and cardiac coherence… In the morning, when synced with the Bellabeat app, Ivy calculates your readiness score – how ready you are for the day on a scale from 0 to 100.”

This score is one of several features of the Ivy that was not present on the FitBits in the available data, which made it a potential value-add feature, but also made it difficult to investigate existing user behaviours for. One important aspect that I could analyse was whether or not users currently wear their devices to bed.

Do people sleep with their FitBits on?

Nighttime usage of FitBits was particularly relevant to the Readiness Score, as that function does not work unless the user wears their device to bed. If people were already frequently wearing their devices to bed, then the marketing could leverage this tendency to push the feature as an improvement on existing features. On the other hand, if users were not wearing their devices overnight, this might indicate that users were not interested in existing overnight-monitoring functions, or that there were other factors that made overnight wear unappealing, which marketing teams would need to take into account.

As of April 2016, the following wristband FitBits were available:

Model Sleep Logging Heart-rate Tracking Source
Alta Yes No User Manual
Blaze Yes Yes User Manual
Charge Yes No User Manual
Charge HR Yes Yes User Manual
Flex 1 Yes No User Manual
Flex 2 Yes No User Manual
Surge Yes Yes User Manual

Sleep logging is available on all devices, so sleep log data should indicate how the overall cohort uses this feature. Heart-rate logging is limited to later devices, so heartrate data will only indicate how that subset of the cohort uses that feature. With this in mind, I focused my analysis on the sleep logs.

Sleep Logs Analysis

For each user, I calculated the number of nights where the total sleep time logged was greater than 6 hours: this value was picked as being close enough to the recommended 8 hours of sleep to indicate the user did actually go to bed with their device on, but not so high that it would not pick up data from users who don’t get a full night’s sleep.

Heart-rate Analysis

The sleep logs data gave a good indication of whether users were interested in sleep logging data, but it didn’t directly confirm whether people were wearing their devices to bed. To investigate this, I also analysed the heart-rate data for each user. This was by far the largest and least-processed data set, but it had the advantage of only generating data if the device was physically contacting the user at that time.

The heart rate data took the form of individual polls recording the user’s momentary heart rate. This presented a challenge compared to using the summary data, in that these individual logs had to be processed to determine when the user was wearing their device continuously. Simply counting the number of logs over a given time period would not work, as the time between logs varies even when the device is worn. Polling times less than 60 seconds, for example, are typically between 1 and 20 seconds, as shown below:

This meant that even choosing the median value of 5 seconds could result in an inferred “time logged” value that’s out by a factor of 4.

To build out the data, I decided on an approach where the time between logs was used to infer the time logged, but only within an acceptable time difference. If one log follows its predecessor by less than this difference, the time between the two is considered “Logged” time: if the delay is greater, we assume the device was taken off for that time. I selected 20 seconds as the maximum acceptable time between polls, based on the “Maximum” value from the boxplot of 17 seconds.

Given the small number of heart-rate users, I also compared the the heart-rate results directly to the sleep log results to see if there was any difference in the nights logged for individual IDs:

For those IDs present in both data sets, the results were almost identical, with only three users showing variation equating to 2-3 nights, indicating the method is valid.

Preliminary Findings

  • Out of all users in the data set, very few users were logging, with 36% (12 users) logging between 0% and 5% of nights, and a majority of 51% (17 users) logging 10% of nights or fewer
  • Out of users with sleep log data, the results were more evenly spread, with 37% (9 users) logging 0% to 15% of nights, and another 42% (10 users) logging 70% to 90% of nights.
  • Given the ubiquity of sleep-tracking functionality amongst FitBit devices, these findings indicated that the majority of users were not making use of this function.
  • Out of all users, 73% (24 users) did not log any heartrate-based sleep logs.
  • Out of all users, 42% (14 users) logged heart-rate data
  • Out of those users, the results were split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights
  • The heart-rate-based analysis produced the same results for individual users as the sleep-log-based analysis
  • Overall, the heart rate-based results reinforced the findings of the sleep-log analysis for the relevant users.

Sleep Tracking

TODO: Feature overview when it’s the start of the day, not the end

How many people used the sleep tracking feature?

From the previous analysis for the Readiness Score feature, we know that sleep tracking is not commonly used, with a majority of 51% (17 users) logging 10% of nights or fewer, and the remainder spread out over a range up to 90% of nights.

What quantity of sleep are users getting?

Sleep logs in the data set rank sleep quality as Awake, Restless, or Asleep. To analyse the quantity of sleep in the cohort, I calculated the total time spent Asleep for each user:

Preliminary Findings

  • The results appear realistic: the mean and median are 6.9 and 7.4 hours, respectively, with 79% of users (19 users) getting between 6 and 9 hours of sleep on average

What quality of sleep are users getting?

To analyse the quantity of sleep in the cohort, I calculated the time spent Asleep as a percentage of total Sleep Time logged for each user:

Preliminary Findings

  • The majority of users appear to be getting high-quality sleep, with 88% of users logging between 90% and 95% of time as Asleep
  • Time spent Awake is almost nil, with 88% of users logging less than 5% of time asleep. This is likely due to the programming of the FitBit devices, which feature an automatic sleep logging function that kicks in when it detects the user may be asleep, based on movement and heart rate data

Did sleep quantity or quality improve over time?

I applied my previous analysis of time correlations to my measures for sleep quality and quantity:

I analysed sleep quantity and quality over time by calculating the coefficients with time of “total hours asleep” and “percentage of time asleep”, respectively:

Preliminary Findings

  • The correlations for sleep quantity and quality formed approximately-normal distributions around zero, indicating no consistent increase for either variable
  • Sleep quantity appeared to decrease slightly, with mean and median correlations of -0.09 and -0.13, respectively
  • Sleep quantity was approximately static, with mean and median correlations of -0.04 and -0.02, respectively
  • Overall, there is nothing to indicate that users increased the quantity or quality of their sleep over time
  • The quantity and quality of sleep for this cohort of users was already relatively good, with the majority of users getting between 6-9 hours sleep, and spending upwards of 90% of that time in a stable Asleep state.

Resting Heart Rate and Cardiac Coherence

Although the FitBits in the data set did have heart rate tracking, and some models from the period had a “Breathing Exercise” feature, there was no data for resting heart rate or cardiac coherence specifically. What I could analyse was the data on heart-rate tracking generally, asking questions including:

  • How often did people track their heart-rates, either during the day or during the night?
  • When people did track their heart-rates, what level of activity were they engaged in? I.e. were they only putting their trackers on to exercise, or did they wear them during lower-activity or sedentary periods?

Answering these questions would help me understand whether or not the RHR and CC functions would appeal to the user base: for instance, users who rarely wear their existing heart-rate monitors, or only wear them during intense exercise, might be less likely to care about their longer-term cardiac performance.

How often do people track their heart-rate?

From my previous analysis on heart-rate data, I knew that overnight heart-rate tracking was split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights. I adapted this analysis of nighttime heart rate logging to investigate daytime usage:

Preliminary Findings

  • Daytime logging is much more popular than nighttime logging
  • 50% of users (7 users) logged 6 hours on between 80% to 95% of days
  • The other 50% of users logged on between 0% and 75% of days

What levels of activity do people typically track?

First the naive approach: What percentage of various users’ time is spent in what heart-rate zone?

Preliminary Findings

  • The majority of logs for all but one user were in the “Sedentary” category (heart rate < 95 bpm)
  • This supported the hypothesis that users were not only using their heart-rate trackers for exercise
  • There appeared to be ample data for Ivy devices to determine a user’s resting heart rate from the daytime data, even if there was little nighttime data
  • This reliance on daytime data might skew the values up, since users are more likely to be at least somewhat active during the day than while asleep

Features without Data

Several functions and features of the Ivy are not present on the FitBits in the data set. Given the lack of data, it’s not possible to directly analyse existing user behaviours with respect to these features, which include:

  • Respiratory Rate
  • Hydration tracking
  • Mindfulness Minute Tracking
  • Menstrual Cycle
  • Wellness Score

REVIEW: Step 5: Share

In which I present all of my key findings and their supporting visualisations

TODO:

Most users don’t get much exercise

Visualising how much time individual users spent in each zone shows a clear bias towards sedentary and lightly-active zones:

AYO THIS A FIG.CAP

AYO THIS A FIG.CAP

Visualising the frequency of Active Hours across all users shows a right-skewed distribution, with most users getting little exercise and a small number of outliers logging higher numbers:

For those who do exercise, there was no clear preference for low- or high-intensity activities

By visualising Very Active Time as a percentage of Active Time (the sum of Very Active and Fairly Active time), I found a fairly symmetrical, bimodal distribution, with approximately equal numbers of users favouring low- and high-intensity activity:

More time spent active is only loosely related to more time spent at high intensities, with a wider range of times and percentages at the bottom than the top:

Users get around 7500 steps in per day on average

Walking is much more popular than running:

Exercise intensity did not increase over time

Users do not increase their step count, walking time, or walking intensity over time

Average daily steps decreased slightly over time for most users:

Average time spent walking daily also decreased slightly over time for most users, at all speeds:

Users prefer to engage in mobile exercise

The data shows a strong correlation between time spent exercising and distance covered, which indicates users prefer mobile exercise; if the opposite was true and users spent their active time stationary, we would have seen more users with higher active minutes but low distance.

Sleep logging was not common; among those who did log, usage was spread out

The majority of users log sleep on no or a minority of nights, with a long tail of users who log an increasing majority of nights:

Analysis of those users who logged heart-rate data shows a similar trend for overnight heart-rate logging:

Sleep logs and heart-rate logs are comparable for users with data for both:

Device comfort may be a factor

Some features cannot be analysed due to a lack of available data

Several functions and features of the Ivy are not present on the FitBits in the data set. Given the lack of data, it’s not possible to directly analyse existing user behaviours with respect to these features, which include:

Step 6: Act

TODO: Embed images

Based on the findings in the previous section, I present the following recommendations for the marketing strategy for the Bellabeat Ivy

Promote overnight device wear as a value-add practice, but highlight device comfort equally

The analysis indicated that the majority of users do not wear their devices overnight. This is a potential problem for marketing, as a number of key features of the device, including the Readiness Score, Resting Heart Rate tracking, Cardiac Coherence, and Sleep Tracking will function in a reduced capacity, or not at all, if the device cannot gather data from the user overnight. In the Help documentation for the Ivy it is explicitly stated that, for the Readiness Score feature, “To fully calibrate your baseline and get more reliable readiness score results you must wear your Ivy for 30 nights.” Marketing materials should therefore highlight these features as beneficial to the user as a result of overnight wear to encourage this practice. This will in turn allow these functions to be positioned as value-add features that do more with the data that is gathered overnight than competing products.

One obvious reason why users may not be wearing their devices overnight is comfort. One of the FitBit manuals referenced in my research makes explicit mention of this, informing the user that “wearing your tracker 24/7 does not allow your skin to breathe”. The design of the Ivy may not be helping matters here: it has a more angular, protruding body than typical “watch”-type wearables, which may negatively impact overnight comfort.

!(Bellabeat Ivy on woman’s wrist)[https://bellabeat.com/wp-content/uploads/2022/05/RHR-1.jpg]

To mitigate these potential consumer concerns, the marketing should highlight the hypo-allergenic and hygenic aspects of the materials used in the Ivy, positioning these as keeping users comfortable during overnight wear.

Showcase users in a broad range of activities, rather than focusing on athleticism

To appeal to the FitBit market, marketing materials for the Ivy’s exercise tracking features should highlight a broad range of activities ranging from gentle to more intense. The marketing should not target highly-athletic groups, as these represent only a small part of the user base.

Focus on mobile activities over stationary activities

Users were found to preference mobile activities, with very little evidence of high-intensity, low-distance exercise in the data set. Marketing materials should therefore focus on mobile, outdoor activites like light jogging, walking dogs, and cycling, instead of stationary activites like gym workouts.

Focus on walking over running

The data showed that running was not particularly popular in the user group, so marketing materials can highlight walking over running, perhaps using the additional focus to highlight a range of people and locations.

Avoid implying that device usage will make users exercise more or sleep better

The marketing for the Ivy should avoid any claims that users will increase their activity levels, their step counts, or their sleep quantity or quality as a result of their purchasing the device, as the analysis does not support this claim. The analysis actually showed the opposite effect, with users typically decreasing on these metrics.

The analysis did find that most users were consistently getting a good amount of high-quality sleep, even if this did not improve over time. The Sleep Tracking can therefore be marketed more as a way of tracking and reinforcing the user’s existing good sleep habits than as a way of improving them. If a manually-activated sleep tracking mode exists, this should be highlighted as well, as users may get more-representative sleep logs if they include time spent trying to get to sleep, rather than relying on the device to detect sleep and only then start logging.

Highlight the benefits of daytime Resting Heart Rate data

The analysis found that, while most users do not log heart rate data overnight, most do log heart rate data during the day. This indicates that, while the primary benefit of the Resting Heart Rate (RHR) function comes from recording deep sleep overnight, there is still an angle to promote the function on the basis of daytime usage. Daytime RHR can be promoted as a measure of overall health or even a target in its own right: for example, some users may wish to track a lowering daytime RHR as an indicator of improved cardiac performance during a training regime. These may not be as easy a sell as the overnight RHR, but they have the advantage of aligning better with existing user behaviours.

Appendices

Appendix A: Verifying factor variables

Validating “Exercise Intensity”

As discussed above, the exercise-related “Intensity” variable is a numerical factor with no defined range: the meaning and limits of this variable had to be determined before the analysis could proceed. I started by determining the range of values present in the source Intensity data:

# Display all unique Exercise Intensity values
unique_intensities <- unique(intensity_src_mins_tall$Intensity)
print(unique_intensities)
## [1] 0 1 2 3

The “activity_sum_days_wide” table provided some context clues as to what these four levels might mean: the table summarised Intensity data into four new variables which, based on their names, appeared to be associated with intensity levels as follows:

  1. Sedentary
  2. Lightly Active
  3. Fairly Active
  4. Very Active

I confirmed this naming convention by using it to reconstruct the daily summary data in “activity_sum_days_wide”, then comparing the two tables and investigating any rows with non-zero differences:

# Reconstruct the summary intensity data for comparison with the original
activity_sum_days_wide_test <- intensity_src_mins_tall %>%
    mutate(activity_date_floored = floor_date(mdy_hms(ActivityMinute), unit = "days")) %>%
    group_by(Id, activity_date_floored) %>%
    summarize(
        minutes_sedentary      = sum(case_when(Intensity == 0 ~ 1, TRUE ~ 0)),
        minutes_lightly_active = sum(case_when(Intensity == 1 ~ 1, TRUE ~ 0)),
        minutes_fairly_active  = sum(case_when(Intensity == 2 ~ 1, TRUE ~ 0)),
        minutes_very_active    = sum(case_when(Intensity == 3 ~ 1, TRUE ~ 0))
    ) %>%
    mutate(Id_ActivityDate_UID = paste(Id, activity_date_floored, sep = "_")) %>%
    arrange(Id, activity_date_floored, Id_ActivityDate_UID)

# Compare both versions of the data and return any dates with different values
intensity_daily_comp <- activity_sum_days_wide %>%
    mutate(ActivityDate_floored = floor_date(mdy(ActivityDate), unit = "days")) %>%
    mutate(Id_ActivityDate_UID = paste(Id, ActivityDate_floored, sep = "_")) %>%
    with(merge(
        .,
        activity_sum_days_wide_test,
        by = c("Id_ActivityDate_UID"),
        all = TRUE
    )
    ) %>%
    mutate(diff_minutes_sedentary      = minutes_sedentary      - SedentaryMinutes     ) %>%
    mutate(diff_minutes_lightly_active = minutes_lightly_active - LightlyActiveMinutes ) %>%
    mutate(diff_minutes_fairly_active  = minutes_fairly_active  - FairlyActiveMinutes  ) %>%
    mutate(diff_minutes_very_active    = minutes_very_active    - VeryActiveMinutes    ) %>%
    select(
        Id_ActivityDate_UID,
        diff_minutes_sedentary,
        diff_minutes_lightly_active,
        diff_minutes_fairly_active,
        diff_minutes_very_active
    ) %>%
    filter(!(diff_minutes_sedentary == 0 & 
                 diff_minutes_lightly_active == 0 & 
                 diff_minutes_fairly_active == 0 & 
                 diff_minutes_very_active == 0)
    ) %>%
    arrange(Id_ActivityDate_UID)

cat("Printing all rows with non-zero diff values:\n")
glimpse(intensity_daily_comp)
rm(activity_sum_days_wide_test)
## Rows: 465
## Columns: 5
## $ Id_ActivityDate_UID         <chr> "1503960366_2016-04-12", "1503960366_2016-…
## $ diff_minutes_sedentary      <dbl> 346, 407, 442, 400, 679, 320, 377, 364, 38…
## $ diff_minutes_lightly_active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diff_minutes_fairly_active  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diff_minutes_very_active    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

The approach above produced no errors, except for the “sedentary minutes” calculation, which was consistently higher in the reconstructed data.

I concluded that I got the mapping correct, given that:

  1. The values were consistently higher by at least 6 hours: this appeared to be due to my method counting time asleep as “sedentary minutes”, which is reasonable
  2. Reversing the order of the mapping produced entirely wrong results.

That being said, I still didn’t know if those were the categories FitBits work in, not another naming convention that the data authors came up with, and I didn’t know if all FitBit models worked this way. To address this, I referred to the manuals of the relevant FitBits.

Activity trackers like FitBits detect activity intensity partly by measuring the user’s heart rate while exercising: a higher heart rate corresponds with a higher degree of exertion. As of April 2016, the three latest FitBit models with heart-rate tracking were:

The product manuals for each model confirmed they all break down user activity into four default heart-rate zones:

Product HR Zone 1 HR Zone 2 HR Zone 3 HR Zone 4
FitBit Blaze “Out of Zone” “Fat burn” “Cardio” “Peak”
FitBit Charge HR “Out of Zone” “Fat burn” “Cardio” “Peak”
FitBit Surge “Out of Zone” “Fat burn” “Cardio” “Peak”

While those weren’t exactly the same terms as used in the data set, they were clearly related - “Out of Zone” equated to “Sedentary”, for example.

All three FitBit manuals also made the same claim that the default zones were “based on American Heart Association recommendations”. Even without validating that claim, it indicated to me that the reasoning behind each zone was not arbitrary, and was consistent across devices, so I concluded that any other FitBit models circa 2016 would follow the following classification scheme:

  1. Sedentary Minutes
  2. Lightly Active Minutes
  3. Fairly Active Minutes
  4. Very Active Minutes

Validating “Sleep Value”

As discussed above, the sleep-related “Value” variable is a numerical factor with no defined range: the meaning and limits of this variable had to be determined before the analysis could proceed. As with exercise intensity, I started by determining the range of values present in the data:

# Display all unique Sleep Value values
unique_sleep_values <- unique(sleep_src_mins_tall$value)
print(unique_sleep_values)
## [1] 3 2 1

The values for sleep quality range from 1 to 3. The summary data tables for sleep quality introduced only two new variable names: “Total Time In Bed”, and “Total Minutes Asleep”. There wasn’t a valid name for each level of factor like there was for exercise, so in this case I referred straight back to the manuals:

  • Blaze: tracks “both your time spent asleep and your sleep quality”
  • Charge HR: tracks “the hours you sleep and your movement during the night”
  • Surge: tracks “the hours you sleep and your movement during the night”

Further research in the FitBit help pages on how to track sleep stats and what they all mean revealed that different devices track slightly different data if they have heart rate tracking:

  • No HR tracking: Generic sleep quality tracking with “Time spent awake, restless, and asleep” categories
  • HR tracking: Sleep stage tracking with “Light Sleep, Deep Sleep, and REM Sleep” stages

The help pages also singled out the Charge HR and the Surge as the only HR-tracking FitBits to not have full sleep stage tracking, leaving the Blaze as the only device from this time period with that feature. Blaze aside, motion-based sleep quality tracking appeared to go all the way back to the FitBit One. Given this information, I concluded that any other FitBit models circa 2016 would follow the following classification scheme for the sleep data:

  • 1: Awake
  • 2: Restless
  • 3: Asleep

My tests revealed that I had this backwards, and the actual scheme was:

  • 1: Asleep
  • 2: Restless
  • 3: Awake

As with the intensity data, I confirmed this naming convention by using it to reconstruct the daily summary data in “activity_sum_days_wide”, then comparing the two tables and investigating any rows with non-zero differences:

# Reconstruct the summary sleep data for comparison with the original
sleep_sum_days_wide_test <- sleep_src_mins_tall %>%
    # mutate(date_typed = mdy_hms(date)) %>%
    mutate(date_floored = floor_date(mdy_hms(date), unit = "days")) %>%
    # Sum time asleep for each Log ID
    group_by(logId) %>%
    summarize(
        "Id" = min(Id),
        # Associate each Log ID with the latest date recorded under it
        "SleepDay" = max(date_floored),
        "minutes_in_bed"   = n(),
        "minutes_awake"    = sum(value == 3),
        "minutes_restless" = sum(value == 2),
        "minutes_asleep"   = sum(value == 1)) %>%
    # Sum time asleep for each date based on SleepDay
    group_by(Id, SleepDay) %>%
    summarize(
        "TotalSleepRecords_2" = n(),
        "TotalMinutesAsleep_2" = sum(minutes_asleep),
        "TotalTimeInBed_2" = sum(minutes_in_bed),
        "TotalMinutesAwake" = sum(minutes_awake),
        "TotalMinutesRestless" = sum(minutes_restless),
    ) %>%
    mutate("Id_SleepDay_UID" = paste(Id, SleepDay, sep = "_")) %>%
    arrange(Id_SleepDay_UID)

# Compare both versions of the data and return any dates with different values
sleepDay_comp <- sleep_sum_days_wide %>%
    # mutate("SleepDay_typed" = mdy_hms(SleepDay)) %>%
    mutate("SleepDay_floored" = floor_date(mdy_hms(SleepDay), unit = "days")) %>%
    mutate("Id_SleepDay_UID" = paste(Id, SleepDay_floored, sep = "_")) %>%
    arrange(Id_SleepDay_UID) %>%
    with(merge(
        .,
        sleep_sum_days_wide_test,
        by = c("Id_SleepDay_UID"),
        all = TRUE
    )
    ) %>%
    mutate(recordDiff = TotalSleepRecords_2 - TotalSleepRecords) %>%
    mutate(sleepDiff = TotalMinutesAsleep_2 - TotalMinutesAsleep) %>%
    mutate(bedDiff = TotalTimeInBed_2 - TotalTimeInBed) %>%
    select(
        Id_SleepDay_UID,
        recordDiff,
        sleepDiff,
        bedDiff
    ) %>%
    filter(!(recordDiff == 0 & sleepDiff == 0 & bedDiff == 0)) %>%
    arrange(Id_SleepDay_UID)

cat("Printing all rows with non-zero diff values:\n")
glimpse(sleepDay_comp)
## Rows: 16
## Columns: 4
## $ Id_SleepDay_UID <chr> "4388161847_2016-04-27", "4388161847_2016-04-30", "438…
## $ recordDiff      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sleepDiff       <dbl> 22, 1, 6, 5, 1, 1, 2, 2, 520, 520, 16, 2, 10, 8, 20, 3
## $ bedDiff         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 543, 543, 0, 0, 0, 0, 0, 0

The two +500-minute values were investigated and found to be the result of double-sampling in the data. That day had two duplicate logs, which in the original sum data was presented as two duplicate rows with the accurate 500-minute value, and in my data was presented as two rows with the full 1000 minutes of data summed. All other rows were identical to or within 22 minutes of the original, which I considered accurate enough to validate the mapping.

Appendix A: “Prepare” Stage Code

# Set up knitting options with the knitr package
knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    results = 'hide',
    warning = FALSE
)

# Load all required packages
print("Loading packages...")

rqd_pkgs <- c(
  "dplyr",
  "anytime",
  "tidyverse",
  "janitor",
  "readr",
  "skimr",
  "tibble",
  "ggplot2",
  "gridExtra",
  "ggpubr",
  "kableExtra",
  "purrr"
)

lapply(rqd_pkgs, function(pkg) {
  if(!requireNamespace(pkg, quietly = FALSE)) {
    cran_mirror <- "https://cran.r-project.org"
    install.packages(as.character(pkg), repos = cran_mirror)
  }
  library(pkg, character.only = TRUE)
})
rm(rqd_pkgs)
print("Loading packages complete.")
# Set the working directory of the R markdown environment to match that of the console
dir <- "/Users/nathanweaver/Documents/GitHub/Bellabeat Case Study"
setwd(dir)
cat("Set working directory to:", dir, "\n")
# Load all CSVs in the source directory into the environment
csv_dir <- "Fitabase_Data_Cleaned"
paths_dfs <- list.files(csv_dir, pattern = "*.csv", full.names = TRUE)

df_names <- paths_dfs %>%
  basename() %>%
  tools::file_path_sans_ext() 

for (i in 1:length(df_names)) {
  assign(df_names[i], read_csv(paths_dfs[i]))
}
rm(i, paths_dfs)

prep_apdx_labs <- grep("^prep-", knitr::all_labels(), value = TRUE)
prep_apdx_labs <- grep("^prep-", knitr::all_labels(), value = TRUE)
invisible(lapply(prep_apdx_labs, print_chunk_with_label))

Appendix B: “Clean” Stage Code

# Remove all NULL and empty-string values
cat("Checking for NULL/empty values...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  num_nulls <- sum(is.na(df))
  if(num_nulls) {
    cat(df_name,": ",num_nulls," NULLs","\n",sep="")
  }
  # Check for empty strings (which do not show up as NULLs)
  empty_strings <- df %>%
    filter(if_any(where(is.character), ~ nchar(.) == 0))
  num_empty_strings = nrow(empty_strings)
  if(num_empty_strings) {
    cat(df_name,": ",num_empty_strings," empty strings","\n",sep="")
  }
}
cat("Checking for NULL/empty values done.\n", sep="")
rm(df, df_name, num_nulls, empty_strings, num_empty_strings, df_name)

# Declare all functions and global variables required for cleaning
rename_df_variables <- function(df_name, var_mods) {
  cat("Renaming variables in \"",df_name,"\"...\n", sep = "")
  df <- get(df_name)
  # Check each var name requiring correction against the var names in the df
  for (i in 1:nrow(var_mods)) {
    var_old = var_mods$var_old[i]
    if (!(var_old %in% colnames(df))) {
      next
    }
    # If found, make sure the conversion is applicable to this or all dfs
    tbl <- var_mods$tbl[i]
    if (tbl != "" && tbl != df_name) {
      next
    }
    # Perform the conversion if all checks passed
    var_new = var_mods$var_new[i]
    cat("df: ",df_name, "\tvar_old: ",var_old,"\t",sep="")
    cat("var_new: ",var_new,"\t", sep="")
    cat("tbl: ",tbl,"    ", sep="")
    cat("Replacing... ", sep = "")
    df <- df %>% rename(!!var_new := !!var_old)
    cat("Done.\n", sep = "")
  }
  rm(i)
  cat("Renaming variables in \"",df_name,"\" complete.\n", sep = "")
  return(df)
}

var_mods <- data.frame(
  var_old = character(0),
  var_new  = character(0),
  type_new = character(0),
  tbl = character(0)
)

# WARNING: Ensure table-specific modifications (tbl != "") are positioned above non-specific modifications with matching var_old/var_new values: only the first modification in the list will be applied to matching variables.
var_mods <- var_mods %>%
  rbind(.,data.frame(var_old="date",                       var_new="bodycomp_datetime",          type_new="POSIXct", tbl="bodycomp_src_logs_wide")) %>%
  rbind(.,data.frame(var_old="time",                       var_new="heart_rate_second",          type_new="POSIXct", tbl="heartrate_src_seconds_tall")) %>%
  rbind(.,data.frame(var_old="value",                      var_new="heart_rate",                 type_new="",        tbl="heartrate_src_seconds_tall")) %>%
  rbind(.,data.frame(var_old="date",                       var_new="sleep_minute",               type_new="POSIXct", tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="value",                      var_new="sleep_rank",                 type_new="",        tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="",                           var_new="activity_hour",              type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="activity_minute",            type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="sleep_day",                  type_new="POSIXct", tbl="")) %>%
  rbind(.,data.frame(var_old="",                           var_new="id",                         type_new="character", tbl="")) %>%
  rbind(.,data.frame(var_old="log_id",                     var_new="sleep_log_id",               type_new="character", tbl="sleep_src_mins_tall")) %>%
  rbind(.,data.frame(var_old="log_id",                     var_new="bodycomp_log_id",            type_new="character", tbl="bodycomp_src_logs_wide")) %>%
  rbind(.,data.frame(var_old="activity_date",              var_new="activity_day",               type_new="Date",   tbl="")) %>%
  rbind(.,data.frame(var_old="fairly_active_distance",     var_new="distance_fairly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="fairly_active_minutes",      var_new="minutes_fairly_active",      type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="light_active_distance",      var_new="distance_lightly_active",    type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="light_active_minutes",       var_new="minutes_lightly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="lightly_active_distance",    var_new="distance_lightly_active",    type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="lightly_active_minutes",     var_new="minutes_lightly_active",     type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="logged_activities_distance", var_new="distance_logged_activities", type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="me_ts",                      var_new="mets",                       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="moderately_active_distance", var_new="distance_moderately_active", type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_active_distance",  var_new="distance_sedentary",         type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_distance",         var_new="distance_sedentary",         type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_active_minutes",   var_new="minutes_sedentary",          type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="sedentary_minutes",          var_new="minutes_sedentary",          type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="step_total",                 var_new="steps_total",                type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_distance",             var_new="distance_total",             type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_intensity",            var_new="intensity_total",            type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_minutes_asleep",       var_new="minutes_asleep_total",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_sleep_records",        var_new="sleep_records_total",        type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_steps",                var_new="steps_total",                type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="total_time_in_bed",          var_new="minutes_in_bed_total",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="tracker_distance",           var_new="distance_tracker",           type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="very_active_distance",       var_new="distance_very_active",       type_new="",        tbl="")) %>%
  rbind(.,data.frame(var_old="very_active_minutes",        var_new="minutes_very_active",        type_new="",        tbl=""))

# Clean and update all variables in data set
cat("Cleaning variable names...\n", sep = "")
for(df_name in df_names) {
  cat("Cleaning ",df_name,"...\n", sep = "")
  assign(df_name, get(df_name) %>% clean_names())
}
cat("Cleaning variable names complete.\n", sep = "")

cat("Renaming variables...\n", sep = "")
var_mods_rename <- var_mods %>%
  filter(var_old != "" & var_new != "")
for(df_name in df_names) {
  assign(df_name, rename_df_variables(df_name, var_mods_rename))
}
rm(df_name)
cat("Renaming variables complete.\n", sep = "")

# Remove all duplicate data in data set
# This function was tested using a prototype version that generated dataframes of all of the apparent duplicate rows: these were verified manually before the function was allowed to modify the actual dataframes. The logic was verified again by re-running it after the initial removal: all dataframes reported zero duplicates on the second run, confirming the success of the first pass.
cat("Checking for duplicate values...\n",sep="")
for (df_name in df_names) {
  cat("Checking ",df_name," for duplicates... ",sep="")
  df <- get(df_name)
  if (!anyDuplicated(df)) {
    cat("0 duplicates removed. Done.\n",sep="")
  } else {
    nrow_before <- nrow(df)
    df <- distinct(df)
    nrow_after <- nrow(df)
    cat("Removing ",(nrow_before - nrow_after)," duplicates... ",sep="")
    assign(df_name, df)
    cat("Done.\n",sep="")
  }
}
rm(df, df_name, nrow_before, nrow_after)
cat("Checking for duplicate values complete.\n",sep="")

# Update non-factor variables

# Declare required functions and global variables ----
var_mods_recast <- var_mods %>%
  filter(type_new != "")

are_identical_lists <- function(list1, list2) {
  if (length(list1) != length(list2)) {
    return(FALSE)
  }
  for (i in seq_along(list1)) {
    if(!identical(list1[[i]], list2[[i]])) {
      cat("Non-identical lists at list1[",i,"]. Exiting.\n", sep = "")
      print(list1[[i]])
      print(list2[[i]])
      rm(i)
      return(FALSE)
    }
  }
  rm(i)
  return(TRUE)
}

get_df_var_types <- function(df_name) {
  cat("Getting current variable types for ", df_name, "... ", sep = "")
  df <- get(df_name)
  var_types <- data.frame(
    var = names(df),
    type = sapply(df, function(col) class(col)[1])
  )
  cat("Done.\n", sep = "")
  return(var_types)
}

get_df_target_var_types <- function(df_name) {
  cat("Getting target variable types for ", df_name, "...\n", sep = "")
  var_types <- get_df_var_types(df_name)
  # Iterate over rows in current var_types
  for (var_row in 1:nrow(var_types)) {
    # Check if var name is in var_mods_recast
    var_name <- var_types$var[var_row]
    for(mods_row in 1:nrow(var_mods_recast)) {
      var_name_mods <- var_mods_recast$var_new[mods_row]
      # If no, skip: if yes, replace type with target
      if(var_name == var_name_mods) {
        type_new <- var_mods_recast$type_new[mods_row]
        cat("Updated \"", var_name, "\" from ", var_types$type[var_row], " to ", var_mods_recast$type_new[mods_row], ".\n", sep = "")
        var_types$type[var_row] <- type_new
      }
    }
  }
  cat("Getting target variable types for ", df_name, " done.\n", sep = "")
  return(var_types)
}

recast_variables <- function(df_name) {
  df <- get(df_name)
  for (i in 1:nrow(var_mods_recast)) {
    var_new <- var_mods_recast$var_new[i]
    type_new <- var_mods_recast$type_new[i]
    if (var_new %in% colnames(df)) {
      cat("Converting ",df_name,"$",var_new," to ",type_new, "... ", sep = "")
      if (type_new == "character") {
        df <- df %>% mutate("{var_new}" := as.character(!!sym(var_new)))
      } else if (type_new == "Date") {
        df <- df %>% mutate("{var_new}" := mdy(!!sym(var_new)))
      } else if (type_new == "POSIXct") {
        df <- df %>% mutate("{var_new}" := mdy_hms(!!sym(var_new)))
      } else {
        cat("type_new not found: not converting.", sep = "")
      }
      cat("Done.\n", sep = "")
    }
  }
  rm(i)
  return(df)
}

# Recast variables ----

cat("Generating list of target column types for testing...\n", sep = "")
df_types_target <-lapply(df_names, get_df_target_var_types)
names(df_types_target) <- df_names
cat("Generating list of target column types complete.\n", sep = "")

cat("Recasting variables...\n", sep = "")
for (df_name in df_names) {
  cat("Recasting ",df_name,"...\n", sep = "")
  assign(df_name, recast_variables(df_name))
  cat("Converting ",df_name," complete.\n", sep = "")
}
cat("Recasting variables complete.\n", sep = "")

# Test recasting of variables ----

cat("Generating list of updated column types...\n", sep = "")
df_types_after <- lapply(df_names, get_df_var_types)
names(df_types_after) <- df_names
cat("Generating list of updated column types complete.\n", sep = "")

cat("Checking updated column types against target types...\n", sep = "")
test_succeeded <- are_identical_lists(df_types_after, df_types_target)
cat("Checking updated column types against target types complete.\n", sep = "")
cat("Data recasting ", case_when(test_succeeded ~ "succeeded", TRUE ~"failed"), ".", sep = "")
rm(df_name, df_types_target, df_types_after, test_succeeded)

# Recast unique factor variables individually
sleep_src_mins_tall <- sleep_src_mins_tall %>%
    mutate(sleep_rank = factor(sleep_rank, levels = 1:3, labels = c("Asleep", "Restless", "Awake")))

intensity_src_mins_tall <- intensity_src_mins_tall %>%
    mutate(intensity = factor(intensity, levels = 0:3, labels = c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))

# Trim variable names
cat("Trimming whitespace in variable names...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in colnames(df)) {
    col_name_trimmed <- str_trim(col_name)
    cat("Trimming ",df_name, "[",col_name,"] to ",col_name_trimmed,"... ",sep="")
    df <- df %>% rename(!!col_name := !!col_name_trimmed)
    cat("Done.\n", sep = "")
  }
}
rm(df, df_name, col_name, col_name_trimmed)
cat("Trimming whitespace in variable names complete.\n", sep="")

# Trim data
trim_chr_column <- function(col) {
  if (is.character(col)) {
    return(str_trim(col))
  } else {
    return(col)
  }
}

cat("Trimming whitespace in character-type data values...\n", sep="")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in colnames(df)) {
    if (is.character(df[[col_name]])) {
      cat("Trimming ",df_name, "[",col_name,"]... ",sep="")
      # df <- df %>% mutate(if_any(where(is.character), trim_chr_column))
      df <- df %>% mutate(across(all_of(col_name), str_trim))
      cat("Done.\n", sep = "")
    }
  }
}
rm(df, df_name, col_name)
cat("Trimming whitespace in character-type data values complete.\n", sep="")

# Validate ID lengths
validate_string_length <- function(df_name, col_name, valid_length) {
  cat("Checking ",df_name," for invalid ",col_name," values... ",sep="")
  df <- get(df_name)
  if (col_name %in% colnames(df)) {
    invalid_values <- df %>% select(!!sym(col_name)) %>% filter(nchar(!!sym(col_name)) != valid_length)
  }
  if (exists('invalid_values') && nrow(invalid_values) > 0) {
    cat(nrow(invalid_values)," invalid values found:\n",sep="")
    glimpse(invalid_values)
    rm(invalid_values)
  } else {
    cat("complete.\n",sep="")
  }
  rm(df)
}

result <- map(df_names, ~validate_string_length(.x, col_name = "id", valid_length = 10))
result <- map(df_names, ~validate_string_length(.x, col_name = "sleep_log_id", valid_length = 11))
result <- map(df_names, ~validate_string_length(.x, col_name = "bodycomp_log_id", valid_length = 13))
rm(result)

# Validate ID ranges
cat("Checking min/max ID value ranges...\n",sep="")
id_col_names <- c("id", "sleep_log_id", "bodycomp_log_id")
for (df_name in df_names) {
  df <- get(df_name)
  for (col_name in id_col_names) {
    if(col_name %in% colnames(df)) {
      cat(df_name,"[",col_name,"] Min = ",min(df[[col_name]])," Max = ",max(df[[col_name]]),".\n",sep="")
    }
  }
}
rm(df, df_name, col_name, id_col_names)
cat("Checking min/max ID value ranges complete.\n",sep="")

# Validate date data
validate_dates <- function(df_name) {
  valid_dates_stt <- as.Date("2016-03-12", format = "%Y-%m-%d")
  valid_dates_end <- as.Date("2016-05-14", format = "%Y-%m-%d")
  
  cat("Validating dates for ",df_name,"... ",sep="")
  
  date_columns <- get(df_name) %>%
    select_if(function(col) is.POSIXct(col) || is.Date(col))
  
  if (ncol(date_columns) == 0) {
    cat("Error: no date column found.\n",sep="")
  } else if (ncol(date_columns) > 1) {
    cat("Error: more than one date column found for ",df_name,".\n",sep="")
  } else {
    outside_range <- date_columns %>%
      filter(if_any(everything(), ~ . < valid_dates_stt | . > valid_dates_end))
    if (nrow(outside_range) > 0) {
      cat("found ",nrow(outside_range)," invalid values:\n",sep="")
      print(outside_range)
      rm(outside_range)
    } else {
      cat("complete.\n")
    }
  }
  rm(date_columns)
}

result <- lapply(df_names, validate_dates)
rm(result)

# Validate all numeric data
validate_numerics <- function(df_name) {
  # This function performs all checks on numeric values that are required in more than one data-frame, e.g. non-negativity and summation
  cat("Validating numerics for ",df_name,"... ",sep="")
  numerics <- get(df_name) %>% select_if(is.numeric)
  if (ncol(numerics) == 0) {
    cat("No numeric variables found.\n",sep="")
  } else {
    # Check for negative values
    negative_values <- numerics %>%
      filter(if_any(everything(), ~ . < 0))
    if (nrow(negative_values) > 0) {
      cat("found ",nrow(negative_values)," invalid values:\n",sep="")
      print(negative_values)
    } else {
      cat("complete.\n")
    }
    rm(negative_values)
    # Check for summation
  }
  rm(numerics)
}

result <- lapply(df_names, validate_numerics)
rm(result)

# Perform additional validation on specific numeric data

# Note: The maximum values given are used as thresholds above which values may not be realistic, not as hard limits for acceptability: a heart-rate of 200bpm, for example, is entirely possible, but it is high enough that I would want to check if the data point corresponded to a period of high-intensity exercise.

# For a given list of dfs, column names, and a min-max range, check all matching columns in all matching dfs against that range
validate_within_range <- function(df_name, column_names, range_min, range_max) {
  df <- get(df_name)
  for (col_name in column_names) {
    cat("Checking ranges for ",df_name,"[",col_name,"]... ",sep="")
    if (!(col_name %in% colnames(df))) {
      cat("column not found.\n")
    } else {
      out_of_range <- df %>%
        filter(!!sym(col_name) < range_min | !!sym(col_name) > range_max)
      if (nrow(out_of_range) <= 0) {
        cat("complete.\n",sep="")
      } else {
        cat("Found ",nrow(out_of_range)," out-of-range values:\n",sep="")
        glimpse(out_of_range)
      }
      rm(out_of_range)
    }
  }
rm(df, col_name)
}

# Minute summation
valid_df_names <- c(
  "activity_sum_days_wide",
  "intensity_sum_days_wide",
  "sleep_sum_days_wide")
valid_col_names <- c(
  "minutes_very_active",
  "minutes_fairly_active",
  "minutes_lightly_active",
  "minutes_sedentary",
  "minutes_asleep_total",
  "minutes_in_bed_total")
range_max <- 60 * 12 # Minutes in half a day
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Weights (kg)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_kg")
range_max <- 150 # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Weights (pounds, same limit as for weight in kilos)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_pounds")
kg2lb <- 2.204623
range_max <- 150 * kg2lb # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(kg2lb)

# BMI 
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("bmi")
range_max <- 40.0 # Corresponds with the WHO "Obese (Class III)" weight category
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Distances
valid_df_names <- c(
  "activity_sum_days_wide",
  "intensity_sum_days_wide")
valid_col_names <- c(
  "distance_lightly_active",
  "distance_logged_activities",
  "distance_moderately_active",
  "distance_sedentary",
  "distance_total",
  "distance_tracker",
  "distance_very_active")
range_max <- 21.08241 # Equivalent to one half-marathon
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Step counts
valid_df_names <- c(
  "activity_sum_days_wide",
  "steps_sum_days_tall",
  "steps_sum_hours_tall",
  "steps_src_mins_tall")
valid_col_names <- c(
  "steps",
  "steps_total")
range_max <- 14800 # Set to double the average daily step count for Australian adults
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Calories
valid_df_names <- c(
  "activity_sum_days_wide",
  "calories_src_mins_tall",
  "calories_sum_days_tall",
  "calories_sum_hours_tall")
valid_col_names <- c("calories")
range_max <- 4000 # Chosen arbitrarily as double the typically-recommended daily caloric intake
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

# Heart Rates
valid_df_names <- c("heartrate_src_seconds_tall")
valid_col_names <- c("heart_rate")
range_max  <- 200 # Chosen based on average 100% heart-rate for a 20 y.o. (Source: American Heart Foundation)
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))

rm(range_max, result, valid_df_names, valid_col_names)

clean_apdx_labs <- grep("^clean-", knitr::all_labels(), value = TRUE)

Appendix C: “Analyse” Stage Code

# Declare generic functions for use in the analysis below

round_data_to_bin <- function(data, bin_width) {
  rounding_factor <- 1 / bin_width
  return(round(data * rounding_factor) / rounding_factor)
}

get_histogram_max_count <- function(data, bin_width) {
  rounding_factor <- 1 / bin_width
  # Round data to nearest multiple of bin_width
  data_rounded <- round_data_to_bin(data, bin_width)
  max_count <- max(table(data_rounded))
  return(max_count)
}

rescale_plot <- function(p, x_min = 0, x_max = 10, x_step = 1, y_min = 0, y_max = 10, y_step = 1) {
  p <- p +
    scale_x_continuous(breaks = seq(x_min, x_max, by=x_step),
                       labels = scales::comma_format(),
                       limits=c(x_min, x_max)) +
    scale_y_continuous(breaks = seq(y_min, y_max, by=y_step),
                       labels = scales::comma_format(),
                       limits=c(y_min, y_max))
  return(p)
}

plot_histo_pareto <- function(data, bin_width) {
  # Cumulative data uses same bins as histogram
  data <- sort(data)
  data_rounded <- round_data_to_bin(data, bin_width)
  
  # Calculate highest count so axes can be adjusted
  highest_count <- get_histogram_max_count(data, bin_width)
  data_pareto <- seq(1, length(data), by = 1) / length(data) * highest_count

  # Calculate stats for overlay on histogram
  data_max <- max(data)
  data_mean <- round(mean(data), digits = 2)
  data_median <- round(median(data), digits = 2)
  sum_stats <- data.frame(Statistics = c("Mean", "Median"),
                          value = c(data_mean, data_median))
  label_text <- paste("Mean =", data_mean, ", Median =", data_median)

  p <- ggplot() +
    geom_histogram(aes(x = data),
                   color = "white",
                   binwidth = bin_width) +
    geom_line(aes(x = data_rounded,
                  y = data_pareto),
                  color="forestgreen") +
    geom_vline(data = sum_stats,
               aes(xintercept = value,
                   linetype = Statistics,
                   color = Statistics),
               size = 1) +
    scale_x_continuous(breaks = seq(0, data_max + bin_width, by = bin_width)) +
    scale_y_continuous(name = "Users",
                       breaks = seq(0, highest_count, by = 1),
                       sec.axis = sec_axis(~./highest_count, name = "Cumulative Percentage of Users")) +
    # scale_linetype_manual(values = c("solid", "dashed", "solid"),
    #                       labels = c("Mean", "Median", "Cumulative")) +
    # scale_color_manual(name = "Legend",
    #                    values = c("red", "blue", "forestgreen")) +
    # annotate("text", x = Inf, y = Inf, label = label_text,
    #        hjust = 1, vjust = 1, size = 4) +
    labs(x = "Value",
         caption = label_text)
  return(p)
}

get_time_coefficients <- function(ids, timestamps, values) {
  tbl <- tibble(
    id = ids,
    timestamp = timestamps,
    value = values
  )

  coeffs <- tbl %>%
    mutate(day_of_year = yday(timestamp)) %>%
    group_by(id) %>%
    summarize(correlation = cor(day_of_year, value))

  return(coeffs)
}

plot_time_coefficients <- function(coeffs) {
  bin_width <- 0.1
  max_count <- get_histogram_max_count(coeffs$correlation, bin_width)
  
  coeffs_mean <- round(mean(coeffs$correlation, na.rm = TRUE), digits = 2)
  coeffs_median <- round(median(coeffs$correlation, na.rm = TRUE), digits = 2)
  sum_stats <- data.frame(Statistics = c("Mean", "Median"),
                          value = c(coeffs_mean, coeffs_median))
  label_text <- paste("Mean =", coeffs_mean, "\nMedian =", coeffs_median)
  
  ggplot(coeffs, aes(x = correlation)) +
    geom_histogram(binwidth = bin_width, color = "white") +
    geom_vline(data = sum_stats,
               aes(xintercept = value,
                   linetype = Statistics,
                   color = Statistics),
               size = 1) +
    scale_x_continuous(breaks = seq(-1.0, 1.0, by=bin_width)) +
    scale_y_continuous(breaks = seq(0, max_count, by=1)) +
    annotate("text", x = Inf, y = Inf, label = label_text,
           hjust = 1, vjust = 1, size = 4) +
    labs(title = "Correlation Coefficient with Time",
         caption = "Positive values imply variable of interest generally increased over time.",
         x = "Correlation Coefficient",
         y = "Count")
}

get_time_coefficients_plot <- function(ids, timestamps, values) {
  coeffs <- get_time_coefficients(ids, timestamps, values)
  p <- plot_time_coefficients(coeffs)
  return(p)
}

wide_to_stacked_bar_plot <- function(data_wide, key, value, key_order) {
  if(!("id" %in% colnames(data_wide))) {
    print("ERROR: data does not include \"id\" column: cannot convert.")
  } else {
    # Convert the data from wide to long. Set the factor levels to control the stacking order of the bars
    data_long <- data_wide %>%
      tidyr::gather(key = !!sym(key), value = !!sym(value), -id) %>%
      mutate(!!sym(key) := factor(!!sym(key), levels = key_order))

    # Order IDs in wide data based on value of first key, then rearrange long data.
    # This ensures the resultant plot sorts the IDs in ascending order of the first key
    first_key <- key_order[1]
    data_wide <- data_wide %>% arrange(!!sym(first_key))
    data_long$id <- factor(
      data_long$id,
      levels = data_wide$id)
    
    # Plot graph with angled ID labels
    p <- ggplot(data_long, aes(x= id, y= !!sym(value), fill= !!sym(key))) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
    # scale_y_continuous(breaks = seq(0, 24, by = 1))
  }
}

scatter_with_LOBF <- function(data_x, data_y, labels = NULL) {
  data <- tibble(
    x = data_x,
    y = data_y
  )
  
  p <- ggplot(data,
              aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "lm",
                se = FALSE,
                color = "blue") +
    stat_cor(mapping=aes(label=..rr.label..),
             method="pearson",
             label.x=-Inf,
             label.y=Inf,
             hjust = -0.1,
             vjust = 1.1) +
    geom_text(aes(label = sprintf("Gradient: %.2f", coef(lm(y ~ x, data = data))[2])),
              x = -Inf,
              y= Inf,
              hjust = -0.05,
              vjust = 3.5)
  
  if(!is.null(labels) && length(labels > 1)) {
    p <- p + labs(title = paste(labels[1], "vs.", labels[2]),
                  x = labels[1],
                  y = labels[2])
  }
  
  return(p)
}

# For each ID, generate averages for the three non-sedentary intensity levels
mean_daily_intensities_wide <- intensity_sum_days_wide %>%
  group_by(id) %>%
  summarize("sedentary" = mean(minutes_sedentary) / 60,
            "lightly_active" = mean(minutes_lightly_active) / 60,
            "fairly_active" = mean(minutes_fairly_active) / 60,
            "very_active" = mean(minutes_very_active) / 60)

#Convert data to long-format for plotting as a stacked-bar chart
intensity_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
  tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
  mutate(intensity = factor(intensity, levels = intensity_order))

# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  mutate("total_active" = fairly_active + very_active) %>%
  arrange(total_active)

# Apply order of IDs to long-format data to force bar stacking order
mean_daily_intensities_long$id <- factor(
  mean_daily_intensities_long$id,
  levels = mean_daily_intensities_wide$id)
viz_avg_time_by_intensity <- ggplot(mean_daily_intensities_long, aes(x= id, y= mean_hours, fill= intensity)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title="Average Daily Time Logged by Intensity Zone",
       x = "User ID",
       y = "Total Average Daily Time Logged (hours)")
print(viz_avg_time_by_intensity)
# Update total_active to reflect new definition
# mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
#   mutate(total_active = fairly_active + very_active) %>%
#   arrange(total_active)

viz_active_hrs_per_week <- plot_histo_pareto(mean_daily_intensities_wide$total_active, bin_width = 0.25) +
  labs(title = "Active Hours per Week",
       x = "Hours")
print(viz_active_hrs_per_week)

# For each ID, generate proportion Very Active time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
  mutate(proportion_very_active = ifelse(total_active > 0,
                                     very_active / total_active,
                                     0))

# Visualise proportion Very Active time
viz_very_active_proportion <- plot_histo_pareto(mean_daily_intensities_wide$proportion_very_active,
                                                bin_width = 0.1) +
  labs(title = "Percentage of Active Time spent Very Active",
       x = "% Very Active")
print(viz_very_active_proportion)
mean_daily_intensities_wide_no_lower_quint <- mean_daily_intensities_wide %>%
  arrange(total_active) %>%
  head(ceiling(nrow(mean_daily_intensities_wide)*0.2))

mean_daily_intensities_wide_no_upper_quint <- mean_daily_intensities_wide %>%
  filter(total_active < 0.75)
viz_very_active_vs_total_active <- ggplot(mean_daily_intensities_wide, aes(x=total_active, y=proportion_very_active)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "blue") +
  stat_cor(aes(label=..rr.label..),
           label.x=Inf,
           hjust = 1,
           label.y= Inf,
           vjust = 1) +
  scale_x_continuous(limits=c(0,2)) + 
  scale_y_continuous(limits=c(0,1)) +
  labs(title="Proportion Very Active vs. Total Active Time",
       x = "Average Weekly Active Time (hours)",
       y = "Proportion of Very Active Time (%)")
print(viz_very_active_vs_total_active)
four_quintiles_length <- floor(nrow(mean_daily_intensities_wide)*0.8)

viz_very_active_vs_total_active_no_upper_quint <- ggplot(
  data = mean_daily_intensities_wide %>% 
    arrange(total_active) %>%
    head(four_quintiles_length),
  aes(x=total_active, y=proportion_very_active)
  ) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              color = "blue") +
  stat_cor(aes(label=..rr.label..),
           label.x=Inf,
           hjust = 1,
           label.y= Inf,
           vjust = 1) +
  scale_x_continuous(limits=c(0,2)) + 
  scale_y_continuous(limits=c(0,1)) +
  labs(title="Proportion Very Active vs. Total Active Time",
       subtitle="Upper Quintile Removed",
       x = "Average Weekly Active Time (hours)",
       y = "Proportion of Very Active Time (%)")
print(viz_very_active_vs_total_active_no_upper_quint)

# For each intensity level, obtain time coefficients per ID

lightly_active_time_coeffs <- get_time_coefficients(
  activity_sum_days_wide$id,
  activity_sum_days_wide$activity_day,
  activity_sum_days_wide$minutes_lightly_active
  ) %>%
  mutate(intensity = "lightly_active")

fairly_active_time_coeffs <- get_time_coefficients(
  activity_sum_days_wide$id,
  activity_sum_days_wide$activity_day,
  activity_sum_days_wide$minutes_fairly_active
  ) %>%
  mutate(intensity = "fairly_active")

very_active_time_coeffs <- get_time_coefficients(
  activity_sum_days_wide$id,
  activity_sum_days_wide$activity_day,
  activity_sum_days_wide$minutes_very_active
  ) %>%
  mutate(intensity = "very_active")

# Bind coefficients into long-format data frame for visualisation
intensity_time_coeffs <- rbind(
  lightly_active_time_coeffs,
  fairly_active_time_coeffs,
  very_active_time_coeffs
)
viz_all_intensity_over_time <- ggplot(intensity_time_coeffs) +
  theme_minimal() +
  geom_freqpoly(aes(x = correlation,
                     color = intensity),
                 binwidth = 0.1,
                size = 1) +
  scale_x_continuous(breaks = seq(-1, 1, by = 0.1)) +
  scale_y_continuous(breaks = seq(0, 10, by = 1)) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  labs(title = "Correlation Coefficients for Active Hours vs. Time",
       subtitle = "Grouped by Intensity",
       x = "Correlation with Time",
       y = "Users")
print(viz_all_intensity_over_time)
# For each intensity level, obtain time coefficients per ID

daily_intensity_proportions <- intensity_sum_days_wide %>%
  mutate(minutes_active = minutes_fairly_active + minutes_very_active,
         proportion_very_active = ifelse(minutes_active > 0,
                                         minutes_very_active / minutes_active,
                                         0))

viz_lightly_active_time_coeffs <- get_time_coefficients_plot(
  daily_intensity_proportions$id,
  daily_intensity_proportions$activity_day,
  daily_intensity_proportions$proportion_very_active
) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  labs(title = "Correlation Coefficients for Very Active Proportion Time",
       x = "Correlation with Time",
       y = "Users")

print(viz_lightly_active_time_coeffs)
brisk_walk_thld <- 80
running_thld <- 110

mean_movement_times_daily <- steps_src_mins_tall %>%
  mutate(activity_day = as.POSIXct(format(as.Date(activity_minute)))) %>%
  group_by(id, activity_day) %>%
  summarize(not_walking = sum(steps == 0) / 60,
            moderate_walking = sum(steps > 0 & steps <= brisk_walk_thld) / 60,
            brisk_walking = sum(steps > brisk_walk_thld & steps <= running_thld) / 60,
            running = sum(steps > running_thld) / 60)

mean_movement_times <- mean_movement_times_daily %>%
  group_by(id) %>%
  summarize(not_walking = mean(not_walking),
            moderate_walking = mean(moderate_walking),
            brisk_walking = mean(brisk_walking),
            running = mean(running)) %>%
  arrange(running)

speed_order = c("not_walking", "moderate_walking", "brisk_walking", "running")
mean_movement_times_long <- mean_movement_times %>%
  tidyr::gather(key = "speed", value = "mean_hours", -id) %>%
  mutate(speed = factor(speed, levels = speed_order))

mean_movement_times_long$id <- factor(
  mean_movement_times_long$id,
  levels = mean_movement_times$id
)
ggplot(mean_movement_times_long,
       aes(x = id, y = mean_hours, fill = speed)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0,33,by=1)) +
  labs(title = "Average Daily Movement Speeds by User ID",
       x = "User ID",
       y = "Average Daily Hours")

viz_walking_speed_frequency <- ggplot(mean_movement_times_long %>% filter(speed != "not_walking")) +
  theme_pubclean() +
  geom_histogram(aes(x = mean_hours,
                     fill = speed),
                 binwidth = 0.25,
                 position = "dodge",
                 color = "black") +
  scale_x_continuous(breaks = seq(0, 5, by = 0.25)) +
  # theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Frequency of Average Walking Hours by Walking Speed",
       x = "Mean Hours",
       y = "Users")
print(viz_walking_speed_frequency)

# Calculate "Active Walking" times for each user
mean_movement_times <- mean_movement_times %>%
  mutate(total_active_walking = brisk_walking + running) %>%
  arrange(total_active_walking)

# Plot histogram "Active Walking" times for each user
viz_active_walking <- plot_histo_pareto(mean_movement_times$total_active_walking, bin_width = 0.25) +
  labs(title = "Average Daily Active-Walking Hours",
       x = "Average Daily Hours")
print(viz_active_walking)

# For each ID, generate proportion Running time
mean_movement_times <- mean_movement_times %>%
  mutate(proportion_running = ifelse(total_active_walking > 0,
                                     running / total_active_walking,
                                     0))

viz_running_proportion <- plot_histo_pareto(mean_movement_times$proportion_running,
                                                bin_width = 0.1) +
  labs(title = "Percentage of Active Walking Time Spent Running",
       x = "% Running")
print(viz_running_proportion)
mean_daily_steps <- steps_sum_days_tall %>%
  group_by(id) %>%
  summarize(mean_steps = mean(steps_total))
distance_vs_active <- activity_sum_days_wide %>%
  select(id,
         activity_day,
         distance_moderately_active,
         distance_very_active,
         minutes_fairly_active,
         minutes_very_active) %>%
  mutate(distance_active = distance_moderately_active + distance_very_active,
         minutes_active = minutes_fairly_active + minutes_very_active) %>%
  group_by(id) %>%
  summarize(mean_distance_active = mean(distance_active),
            mean_minutes_active = mean(minutes_active))
# Analyse overnight usage using sleep log data
hours_asleep <- 6
total_nights <- 33
nights_logged_by_sleep_log <- sleep_sum_days_wide %>%
  group_by(id) %>%
  summarize(count_logged_asleep = sum(minutes_asleep_total / 60 > hours_asleep),
            count_logged_in_bed = sum(minutes_in_bed_total / 60 > hours_asleep),
            pct_logged_asleep = count_logged_asleep / total_nights,
            pct_logged_in_bed = count_logged_in_bed / total_nights)

plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - Logged Users Only",
       subtitle = "Source: Sleep Log Data")

# Add missing ID values to the dataset to analyse the full cohort
missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_sleep_log$id)
for (id in missing_ids) {
  nights_logged_by_sleep_log <- nights_logged_by_sleep_log %>%
    rbind(.,data.frame(
          id = id,
          count_logged_asleep = 0,
          count_logged_in_bed = 0,
          pct_logged_asleep = 0.0,
          pct_logged_in_bed = 0.0))
}

# Visualise overnight usage using sleep log data
viz_night_usage_sleep_logs <- plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - All Users",
       subtitle = "Source: Sleep Log Data")
print(viz_night_usage_sleep_logs)

# Analyse heartrate polling variance
heartrate_polling_variance <- heartrate_src_seconds_tall %>%
  group_by(id) %>%
  mutate(polling_diff = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  filter(polling_diff < 60)

# Visualise heart rate polling variance
ggplot(heartrate_polling_variance) +
  geom_boxplot(aes(y = polling_diff)) +
  scale_y_continuous(breaks = seq(0,60,by=5)) +
  labs(title="Variation in Heart Rate Polling",
       caption="Distribution shown for polling differences less than 60 seconds",
       y="Time Between Polls (seconds)")

# Analyse overnight usage using heart rate data

# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on
sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
sleep_range_stt_hour <- 22
sleep_range_end_hour <- 6
# Anything more than this time between logs is considered a removal (typical poll rate is 1-20 seconds)
max_poll_gap_secs <- 20
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
nights_total <- 33

nights_logged_by_heart_rate <- heartrate_src_seconds_tall %>%
  # For logs in the AM, the night-of is set to the day before
  mutate(night_of_yday = ifelse(hour(heart_rate_second) >= sleep_range_stt_hour, yday(heart_rate_second),
                                ifelse(hour(heart_rate_second) < sleep_range_end_hour, yday(heart_rate_second) - 1,
                                       NA))) %>%
  # Only consider logs from the "sleep" range
  filter(!is.na(night_of_yday)) %>%
  # Group by id so that difftime only compares timestamps from the same user
  group_by(id) %>%
  mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  # Remove any logs that are followed by too long a gap
  filter(time_since_last_poll <= max_poll_gap_secs) %>%
  group_by(id, night_of_yday) %>%
  # Find the total time logged as the sum of time between valid logs
  summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
  # Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
  group_by(id) %>%
  summarize(nights_logged = sum(logged_time_hours > min_hours_logged),
            nights_logged_pct = nights_logged / nights_total)

# Plot histogram showing who logged what fraction of their nights
viz_night_usage_heartrate <- plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
  labs(title = "Percentage of Nights Logged In Bed - HR Users Only",
       subtitle = "Source: Heart Rate Data")
print(viz_night_usage_heartrate)

# Visualise sleep logs vs heart rate logs
sleep_data_merged <- merge(nights_logged_by_sleep_log, nights_logged_by_heart_rate, by = "id", all = TRUE) %>%
  select(id, pct_logged_asleep, nights_logged_pct) %>%
  rename(sleep_logs = pct_logged_asleep,
         heartrate_logs = nights_logged_pct)
sleep_data_merged <- tidyr::gather(sleep_data_merged, key = "Variable", value = "Value", -id)

viz_sleep_logs_vs_hr_logs <- ggplot(sleep_data_merged, aes(x = id, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7, color = "black") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  labs(title = "Nights Logged: Sleep Logs vs. Heartrate Logs",
       x = "ID",
       y = "% of Nights Logged",
       fill = "Source")
print(viz_sleep_logs_vs_hr_logs)

# Analyse sleep quantity and quality
sleep_sum_quant_qual <- sleep_src_mins_tall %>%
  # Set the date for sleep logs in the AM to the day before, i.e. the Night Of [date]
  mutate(sleep_date = as.Date(ifelse(hour(sleep_minute) > 12,
                                     as.Date(sleep_minute),
                                     as.Date(sleep_minute) - 1))) %>%
  # mutate(sleep_date = as.Date(sleep_date)) %>%
  group_by(id, sleep_date) %>%
  summarize(minutes_awake    = sum(sleep_rank == "Awake"),
            minutes_restless = sum(sleep_rank == "Restless"),
            minutes_asleep   = sum(sleep_rank == "Asleep"),
            minutes_total    = minutes_awake + minutes_restless + minutes_asleep,
            pct_awake        = minutes_awake / minutes_total,
            pct_restless     = minutes_restless / minutes_total,
            pct_asleep       = minutes_asleep / minutes_total)

sleep_sum_quant_qual_mean <- sleep_sum_quant_qual %>%
  group_by(id) %>%
  summarize(across(.cols = -sleep_date, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
  mutate(hours_total_mean = minutes_total_mean / 60)

# Visualise sleep quantity
plot_histo_pareto(sleep_sum_quant_qual_mean$hours_total_mean, bin_width = 1) +
  labs(title = "Average Hours Asleep",
       x = "Hours Asleep")

# Visualise sleep quality

plot_histo_pareto(sleep_sum_quant_qual_mean$pct_asleep_mean, bin_width = 0.05) +
  labs(title = "Percentage of Sleep Logs Spent Asleep",
       x= "% of Time")
# Brute-force visualisation of overall sleep quality over time
ggplot(sleep_sum_quant_qual,
       aes(x = sleep_date,
           y = pct_asleep)) +
  geom_point() +
  geom_smooth(method = "lm",
                se = FALSE,
                color = "blue") +
    stat_cor(mapping=aes(label=..rr.label..),
             method="pearson",
             label.x=-Inf,
             label.y=Inf,
             hjust = -0.1,
             vjust = 1.1) +
  facet_wrap(vars(id))
# Brute-force #### Preliminary Findings Most lines are flat, with some (5-6) that are slightly down

# Visualise changes in sleep quantity over time
get_time_coefficients_plot(sleep_sum_quant_qual$id,
                         sleep_sum_quant_qual$sleep_date,
                         sleep_sum_quant_qual$minutes_asleep) +
  labs(title = "Correlation Coefficents for Average Total Time Asleep vs. Time")

# Visualise changes in sleep quality over time
get_time_coefficients_plot(sleep_sum_quant_qual$id,
                           sleep_sum_quant_qual$sleep_date,
                           sleep_sum_quant_qual$pct_asleep) +
  labs(title = "Correlation Coefficents for Average Proportion of Time Asleep vs. Time")

# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on

sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
period_stt_hour <- 6
period_end_hour <- 22
# Typical poll rate is 1-20 seconds: increased to 60 to allow for device moving out of position during daytime movement
max_poll_gap_secs <- 60
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
periods_total <- 33
logging_days <- TRUE

if (logging_days) {
  periods_logged <- heartrate_src_seconds_tall %>%
    mutate(yday = ifelse((hour(heart_rate_second) >= period_stt_hour & hour(heart_rate_second) < period_end_hour),
                         yday(heart_rate_second),
                         NA))
} else {
  # For night logging, set yday to the day before, i.e. "Night of [yday]"
  periods_logged <- heartrate_src_seconds_tall %>%
    mutate(yday = ifelse(hour(heart_rate_second) >= period_stt_hour,
                         yday(heart_rate_second),
                         ifelse(hour(heart_rate_second) < period_end_hour,
                                yday(heart_rate_second) - 1,
                                NA)))
}

periods_logged <- periods_logged %>%
  # Only consider logs within the period of interest
  filter(!is.na(yday)) %>%
  # Group by id so that difftime only compares timestamps from the same user
  group_by(id) %>%
  mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
  # Remove any logs that are followed by too long a gap
  filter(time_since_last_poll <= max_poll_gap_secs) %>%
  group_by(id, yday) %>%
  # Find the total time logged as the sum of time between valid logs
  summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
  # Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
  group_by(id) %>%
  summarize(periods = sum(logged_time_hours > min_hours_logged),
            periods_pct = periods / periods_total)

# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(periods_logged$periods_pct, bin_width = 0.05) +
  labs(title = "Percentage of Periods Logged",
       subtitle = "Source: Heart Rate Data")

# Analyse intensity levels

# From the Charge HR manual:
#   Sedentary = 0-50% Max HR
#   Lightly Active = 50-70% Max HR
#   Fairly Active = 70-85% Max HR
#   Very Active = 85-100% Max HR
# Max HR = 220 - age
# Arbitrarily set median age at 30 for initial analysis

assumed_median_age <- 30
max_heart_rate <- 220 - assumed_median_age
hr_zone_1 <- 0.5 * max_heart_rate
hr_zone_2 <- 0.7 * max_heart_rate
hr_zone_3 <- 0.85 * max_heart_rate

hr_zone_data <- heartrate_src_seconds_tall %>%
  mutate(hr_zone = ifelse(heart_rate >= 0 & heart_rate < hr_zone_1, "sedentary",
                   ifelse(heart_rate >= hr_zone_1 & heart_rate < hr_zone_2, "lightly_active",
                   ifelse(heart_rate >= hr_zone_2 & heart_rate < hr_zone_3, "fairly_active",
                   ifelse(heart_rate >= hr_zone_3, "very_active", NA))))) %>%
  group_by(id) %>%
  summarize(sedentary = sum(hr_zone == "sedentary"),
            lightly_active = sum(hr_zone == "lightly_active"),
            fairly_active = sum(hr_zone == "fairly_active"),
            very_active = sum(hr_zone == "very_active"))

#Convert data to long-format for plotting as a histogram
hr_zone_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
hr_zone_data_long <- hr_zone_data %>%
  tidyr::gather(key = "hr_zone", value = "count", -id) %>%
  mutate(hr_zone = factor(hr_zone, levels = hr_zone_order))

# Visualise intensity levels analysis
ggplot(hr_zone_data_long, aes(x= id, y= count, fill= hr_zone)) +
  geom_bar(stat = "identity",
           position = "fill") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Count of HR Logs by HR Zone",
       x = "User ID",
       y = "Count")

analyse_apdx_labs <- grep("^analyse-", knitr::all_labels(), value = TRUE)

Appendix D: “Share” Stage Code

print(viz_avg_time_by_intensity)
print(viz_active_hrs_per_week)
print(viz_very_active_proportion)
print(viz_very_active_vs_total_active)
print(viz_mean_daily_steps)
print(viz_walking_speed_frequency)
print(viz_all_intensity_over_time)
print(viz_mean_daily_steps_vs_time)
print(viz_all_walking_over_time)
print(viz_distance_vs_activity)
print(viz_night_usage_sleep_logs)
print(viz_night_usage_heartrate)
print(viz_sleep_logs_vs_hr_logs)
share_apdx_labs <- grep("^share-", knitr::all_labels(), value = TRUE)